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Abstract. Network communities represent mesoscopic structure for understanding 
the organization of real-world networks, where nodes often belong to multiple 
communities and form overlapping community structure in the network. Due to 
non-triviality in finding the exact boundary of such overlapping communities, this 
problem has become challenging, and therefore huge effort has been devoted to detect 
overlapping communities from the network. 

In this paper, we present PVOC (Permanence based Vertex-replication algorithm 
for Overlapping Community detection), a two-stage framework to detect overlapping 
community structure. We build on a novel observation that non-overlapping 
community structure detected by a standard disjoint community detection algorithm 
from a network has high resemblance with its actual overlapping community structure, 
except the overlapping part. Based on this observation, we posit that there is 
perhaps no need of building yet another overlapping community finding algorithm; 
but one can efficiently manipulate the output of any existing disjoint community 
finding algorithm to obtain the required overlapping structure. We propose a new 
post-processing technique that by combining with any existing disjoint community 
detection algorithm, can suitably process each vertex using a new vertex-based metric, 
called permanence, and thereby finds out overlapping candidates with their community 
memberships. Experimental results on both synthetic and large real-world networks 
show that PVOC significantly outperforms six state-of-the-art overlapping community 
detection algorithms in terms of high similarity of the output with the ground-truth 
structure. Thus our framework not only finds meaningful overlapping communities 
from the network, but also allows us to put an end to the constant effort of building 
yet another overlapping community detection algorithm. 


1. Introduction 

One of the most used aspects of social network analysis is to discover and display clusters 
and communities in networks - the dense sub-networks, where there are more links 
internally, than externally. It is easy for the common person to spot dense clusters of 
connection in a small network visualization. However, this is extremely difficult problem 
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to detect such groups from large scale networks. There has been a constant effort since 
last one decade from the researchers of both computer science and physics domains to 
explore such community structure from networks after the pioneering effort of Girvan 
and Newman [1]. Today there are dozens of community detection algorithms that 
can detect the disjoint/non-overlapping community structure from the network using 
different heuristics and frameworks (see [21 13] for the survey). However, in real-world 
scenario, it has been observed that a node can be a part of multiple communities, which 
has eventually led to the idea of overlapping/soft communities [U E] E] [7] . This problem 
is even more harder because of the exponential number of possible solutions. Therefore, a 
new direction of research has been started to detect the overlapping community structure 
from the network (see [8] for the survey). 

The dichotomy between “disjoint” and “overlapping” community detection 
algorithms is unfortunate because it limits the application of each algorithm. If 
a network has overlapping communities, a “disjoint” algorithm cannot find them; 
conversely, if communities are known to be disjoint, a “disjoint” algorithm will generally 
perform better than an “overlapping” algorithm. Therefore, to obtain the actual 
community structure, it is important to choose the right kind of algorithm. Note that 
the question of how to choose the right kind of algorithm is outside the scope of the 
present paper. 

However, we hypothesize that there is perhaps no need to develop yet another 
overlapping community hnding algorithm given the assumption that we have diverse 
and efficient disjoint community detection algorithms in hand. In this paper, we present 
a method to allow any “disjoint” community detection algorithm to be used to detect 
overlapping community structure instead for finding another overlapping community 
detection algorithm. This means that a user wishing to find overlapping communities 
need no longer be forced to use one of the overlapping algorithms that exist, but can also 
choose from the many disjoint community finding algorithms. The proposed framework 
is called as PVOC (Permanence based Vertex-replication algorithm for Overlapping 
Community detection) which is a two-phase framework - in the first step, an efficient 
disjoint community detection algorithm is used to detect the non-overlapping community 
structure from the network; in the second step, each node in the disjoint communities 
is processed appropriately using a new vertex-based metric, called permanence 0 , in 
order to measure the extent of belongingness of a vertex in its own community and 
its attached neighboring communities. If the membership of the vertex in its assigned 
community is similar to that in the neighboring community, we assign the vertex into the 
neighboring community, keeping its original community intact. Thus the post-processing 
step is the fundamental component in PVOC to find out overlapping vertices from the 
non-overlapping structure. 

We compare our framework with six state-of-the-art overlapping community 
detection algorithms on both synthetic and large real-world networks (whose ground- 
truth community structure is available). We observe that PVOC signihcantly 
outperforms other baseline algorithms in terms of high resemblance of the output with 
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the ground-truth structure. Moreover, we show that even if it is scalable, it does not 
compromise the correctness of the output. 

Our paper makes several unique contributions to the state-of-the-art in community 
detection. These include (i) analyzing the real-world community structure and observing 
that the disjoint communities are enough to be processed for discovering overlapping 
community structure, (ii) proposing a new framework by combining existing disjoint 
community detection algorithm along with the post-processing step, (iii) showing the 
accuracy of PVOC in terms of accurately discovering the ground-truth structure. 

The organization of the paper is as follows. In the next section, we provide a brief 
overview of state-of-the-art approaches in overlapping community detection. Section [3] 
provides a brief description of the synthetic and real-world datasets. Following this, 
in Section lU we present a detailed results of our empirical observation followed by 
the description of our proposed framework. Section [5] describes the results of the 
experiments to detect overlapping communities and a comparative analysis with the 
baseline algorithms. The experiments in this paper use a combination of PVOC with 
two existing disjoint community detection algorithms, Louvain nDi and Infomap m- 
Finally, we conclude the paper in Section |6] with some immediate future directions. 

2. Related work 

There has been a class of algorithms for network clustering, which allow nodes belonging 
to more than one community. Palla proposed “CFinder” [12], the seminal and most 
popular method based on clique-percolation technique. However, due to the clique 
requirement and the sparseness of real networks, the communities discovered by CFinder 
are usually of low quality [H] . The idea of partitioning links instead of nodes to discover 
community structure has also been explored [H HSl CEl HTj. 

On the other hand, a set of algorithms utilized local expansion and optimization to 
detect overlapping communities. For instance, Baumes et al. [IH] proposed a two- 
step algorithm “RankRemoval” using a local density function. LFM [19] expands 
communities from a random seed node to form a natural community until a fitness 
function is locally maximal. MONC [20] uses the modihed htness function of LFM 
which allows a single node to be considered a community by itself. OSLOM [5] tests 
the statistical signihcance of a cluster with respect to a global null model (i.e., the 
random graph generated by the configuration model) during community expansion. 
Chen et al. 121 ] proposed selecting a node with maximal node strength based on two 
quantities - belonging degree and the modified modularity. EAGLE [6] and GCE [22] 
nse the agglomerative framework to prodnce overlapping communities. COCD [23] first 
identifies cores and then remaining nodes are attached to cores with which they have 
maximum connections. 

Few fuzzy community detection algorithms have been proposed that quantify the 
strength of association between all pairs of nodes and communities [2l]. Nepusz et 
al. [25] modeled the overlapping community detection as a nonlinear constrained 
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optimization problem which can be solved by simulated annealing methods. Zhang et 
ah [26] proposed an algorithm based on the spectral clustering framework. Due to the 
probabilistic nature, mixture models provide an appropriate framework for overlapping 
community detection [23 |28l[29l [30] . MOSES [31] uses a local optimization scheme in 
which the fitness function is defined based on the observed condition distribution. Zhang 
et ah used Nonnegative Matrix Factorization (NMF) to detect overlapping communities 
when the number of communities and the feature vectors are provided [32l[33]. Ding et 
ah [31] employed the affinity propagation clustering algorithm for overlapping detection. 
Recently, BIGCLAM [35] algorithm is also built on NMF framework. 

The label propagation algorithm has been extended to overlapping community 
detection by allowing a node to have multiple labels. In COPRA [36], each node updates 
its belonging coefficients by averaging the coefficients from all its neighbors at each time 
step in a synchronous fashion. SLPA [ST] |3H] spreads labels between nodes according to 
pairwise interaction rules. A game-theoretic framework is proposed in Chen et ah [39] 
in which a community is associated with a Nash local equilibrium. 

Beside these, CONCA [7] extends CN algorithm [40] by allowing a node to split 
into multiple copies. Zhang et ah im proposed an iterative process that reinforces the 
network topology and propinquity that is interpreted as the probability of a pair of nodes 
belonging to the same community. Istvan et ah [42] proposed an approach focusing 
on centrality-based influence functions. Recently, Copalan and Blei [43] proposed an 
algorithm that naturally interleaves subsampling from the network and updating an 
estimate of its communities. The reader can get more details in a nice survey paper by 
Xie et ah [8]. 

3. Test suite of networks 
3.1. Synthetic networks 

It is necessary to have good benchmarks to both study the behavior of a proposed 
community detection algorithm and to compare the performance across various 
algorithms. In light of this requirement, Lancichinetti et ah [44] introduced LFhfl 
benchmark networks that take into account heterogeneity into degree and community 
size distributions of a network. These distributions are governed by power laws with 
exponents ti and T 2 respectively. To generate overlapping communities On, the fraction 
of overlapping nodes is specihed and each node is assigned to Om (> 1) communities. 
LFR also provides a rich set of parameters to control the network topology, including 
the number of nodes n, the mixing parameter /i, the average degree k, the maximum 
degree kmax, the maximum community size Cmax, and the minimum community size 
Cmin- We vary these parameters depending on the experimental needs. Unless otherwise 
stated, LFR graph is generated with the following configuration: p = 0.2, At=10,000, 
Om=4, On=5%; other parameters being set to their default values. Results shown are 

f http://sites.google.com/site/andrealancichinetti/files 
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the average of 100 runs. 



Figure 1. (Color online) Distribution of the number of community memberships of 
vertices. X-axis shows the number of community memberships of vertices and y-axis 
shows the fraction of vertices with certain number of community memberships. 


Table 1. Properties of the real-world networks used in this experiments. N\ number 
of nodes, C: number of communities, S: average size of a community, Om- average 
number of community memberships per node. 


Networks 

N 

E 

C 

S 

Om 

DBLP 

317,080 

1,049,866 

13,477 

429.79 

2.57 

Amazon 

334,863 

925,872 

151,037 

99.86 

14.83 

Youtube 

1,134,890 

2,987,624 

8,385 

9.75 

10.26 

Orkut 

3,072,441 

117,185,083 

6,288,363 

34.86 

95.93 


3.2. Real-world networks with ground-truth communities 

We use four real-world network^ proposed by Yang and Leskovec [321 Si] whose 
underlying ground-truth community structures are known a priori and whose properties 
are summarized in Table [U Figure [U shows the distribution of the number of 
communities memberships of vertices for the real-world networks. 

DBLP: It is a co-authorship network where nodes represent authors and edges 
connect nodes whose corresponding authors have co-authored in at least one paper. 
Since research communities stem around conferences or journals, the publication venues 
are used as ground-truth communities in DBLP. 

Amazon: It is a Amazon product co-purchasing network where nodes represent 
products and edges connect commonly co-purchased products. Each product (i.e., node) 
belongs to one or more product categories. Each product category is used to define a 
ground-truth community. 

§ http://snap.stanford.edu 
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Figure 2. (Color online) An illustrative example to show the procedure followed in 
our empirical study (NCDA: Non-overlapping Community Detection Algorithm). 


Youtube: In the Youtube social network, users form friendship with each other 
and users can create groups where other users can join. Here, such user-dehned groups 
are considered as ground-truth communities. 

Orkut: Orkut is a free on-line social network where users form friendship with 
each other. Orkut also allows users to form a group where other members can then join. 
Here also such user-dehned groups are considered as ground-truth communities. 

4. Vertex-replication algorithm 

Our proposed algorithm is motivated from an empirical study on the ground-truth 
community structure of both synthetic and real-world networks. In this section, we 
hrst describe the empirical observation and then illustrate a new algorithm that can 
detect overlapping communities from a network with the help of any standard disjoint 
community detection algorithm. 

f.l. Empirical observation 

We empirically study the structure of the ground-truth communities. We speculated 
that if we remove the vertices that are part of multiple communities from the ground- 
truth structure, the rest of the portion, i.e., the community structure composed of only 
non-overlapping vertices can be efficiently captured by the standard disjoint community 
detection algorithm. To verify this intuition, we take all the networks with their ground- 
truth communities and two standard disjoint community detection algorithms, namely 
Louvain|inj and Infomap|lH l46] . Then for each network, we run the following steps: 

I We run each of these algorithms to obtain the disjoint community structure from 
the network. 
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Table 2. Number of communities in the ground-truth structure and that obtained 
from Louvain and Infomap for LFR and real-world networks. Here for the LFR 
network, we consider the following configuration: fV=10,000, /i=0.2, 0^=4, 0„=5%. 
The result of LFR is averaged over 100 runs. 


Networks 

Ground-truth 

Algorithms 

Louvain 

Infomap 

LFR 

582 

468 

501 

DBLP 

8,493 

7,987 

8,145 

Amazon 

151,037 

142,098 

149,876 

Youtube 

8,385 

7,967 

7,132 

Orkut 

288,363 

284,980 

286,791 


II Since we know the ground-truth community structure of the network, we remove 
from the ground-truth those vertices (refer to set Vo) which belong to multiple 
communities. 

III Similarly, we remove the constituent vertices of Vo from the disjoint community 
structure obtained from Step I. This step makes sure that the filtered ground-truth 
community structure and the filtered disjoint community structure obtained from 
the algorithm contain same set of vertices. 

IV Then we compare two community structures obtained from Step II and Step III. 



Figure 3. (Color online) Similarity (in terms of NMI) of the community structure 
obtained from two disjoint community detection algorithms (Louvain and Infomap) 
with the ground-truth structures after excluding the overlapping vertices. Here for 
the LFR network, we consider the following configuration: A^=10,000, /j,= 0.2, Om=4, 
0„=5%. 

A schematic example of the above procedure is shown in Figure [2j Figure [T] shows 
that in this process, we discard nearly 40% of the vertices (on an average) which belong to 
multiple communities for each network. In Table [21 we also report the number of disjoint 
communities obtained from Louvain and Infomap algorithms for both synthetic and 
real-world networks and that present in the ground-truth structure. We use a standard 
validation metric, namely Normalized Mutual Information (NMI) j47] to compare these 
two community structures. Figure |3] shows that the similarity is quite high for all the 
networks; this observation indeed corroborates our earlier speculation. Therefore, we 
hypothesize that a standard disjoint community detection algorithm might be able to 
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find the overlapping communities with a suitable post-processing step. This means 
that a user wishing to find overlapping communities need no longer be forced to use 
any overlapping community hnding algorithm, rather a disjoint community structure 
followed by a post-processing step might produce the expected overlapping community 
structure. In the rest of this section, we shall use this observation to design a suitable 
post-processing technique. 

4 . 2 . Permanence based vertex-replication algorithm 

Through careful inspection mentioned above, we have found that a standard disjoint 
community detection algorithms are quite efficient to detect the non-overlapping part 
of the community structure. However there exist few vertices in the network, which are 
part of multiple communities. We intend to design an efficient algorithm that would 
be able to identify such overlapping vertices with their community memberships. For 
that, we use a vertex-based metric, called permanence, which by virtue of its underlying 
formulation measures how intensely a vertex belongs to its community [9]. Below, we 
present a brief overview of the formulation of permanence. 



Figure 4. Toy example depicting permanence of two vertices u and v. Even if 
vertex v has a large number of external connections than u, all these six external 
connections are distributed equally into three neighboring communities, resulting in 
the external pull proportional to 2; whereas u is attached with 4 external neighbors, 
three of them constitute in one community and the rest is attached with another 
community, resulting in the external pull proportional to 3. On the other hand, v is 
connected to 3 internal neighbors which are further completely connected among each 
other; whereas the neighbors of u are partially connected. This results in high internal 
pull of V as compared to u. These two notions of connectively are considered in the 
formulation of permanence. 


4-2.1. Formulation of permanence In an earlier paper [9], we showed that the extent 
of membership of a vertex to a community depends on the following two factors, (i) 
The hrst factor is the distribution of external connections of the vertex to individual 
communities. A vertex that has equal number of connections to all its external 
communities (e.g., a vertex with total 6 external connections with 2 to each of 3 
neighboring communities) has equal “pull” from each community whereas a vertex 
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with more external connections to one particular community (e.g., a vertex with total 
6 external connections with 1 connection each to two neighboring communities and 4 
connections to the third neighboring community), will experience more “pull” from that 
community due to large number of external connections to it. (ii) The second factor 
is the density of its internal connections. The internal connections of a community are 
generally considered together as a whole. However, how strongly a vertex is connected 
to its internal neighbors can differ. To measure this internal connectedness of a vertex, 
one can compute the clustering coefficient of the vertex with respect to its internal 
neighbors. The higher this internal clustering coefficient, the more tightly the vertex is 
connected to its community. 

Combining these two factors together, we formulated permanence Perm{.) of a 
vertex v as follows: 

= iS)" CM ■ 

where I{v) is the number of internal connections of v, D{v) is the degree of n, 
Emax{v) is the maximum connections of n to a single external community and Cin{v) is 
the clustering coefficient among the internal neighbors of v. An illustrative example is 
shown in Figure 01 

For vertices that do not have any external connections, Perm{v) is considered to be 
equal to the internal clustering coefficient (i.e., Perm{v) = Cin{v)). The maximum value 
of Perm{v) is 1 and is obtained when vertex v is an internal node and part of a clique. 
The lower bound of Perm{v) is close to -1. This is obtained when I{v) D{v), such 
that (^) ~ 0 and Cin{v) = 0. Therefore for every vertex v, —1 < Perm{v) < 1. 

4-2.2. The PVOC algorithm Since permanence can assign a score to each of 
the vertices, we can use it in our post-processing step to identify overlapping 
vertices from the detected disjoint community structure. Subsequently, we develop 
a new algorithm, called PVOC (Permanence based Vertex-replication algorithm for 
Overlapping Community detection) that can combine any existing disjoint community 
detection algorithm with the permanence based vertex-replication for detecting 
overlapping community structure of a network. Algorithm 0] presents the pseudo-code 
of PVOC. 

Given undirected network G{V, E) and a threshold 6, the algorithm works as follows: 

I A standard disjoint community detection algorithm Ad is used to detect non¬ 
overlapping community structure NC from G. 

II A set of vertices Vj, are identified from NC such that each constituent vertex in Vj. 

has at least one connection to any external community. 

Ill For each vertex v in 14, we do the following steps: 

(a) We calculate the sum of permanence of v and its neighbors in their assigned 
communities. 
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Algorithm 1 PVOC: Permanence based vertex-replication algorithm for overlapping 
community detection 

Input: A graph G = (Id, E)] Ad= disjoint community detection algorithm; 9 =threshold 
Output: Detected overlapping communities 

procedure Vertex_replication(G, NC, 9) 

Ve = set of vertices having at least one external neighbor 
for all do u G 14 

= current community of v 
Measure current permanence of v, Op{v) 

Measure current sum of permanences of all neighbor’s of v, On(v) 

Sum-Op = Op(v) + On(v) 

for all do n E N > N is the set of external neighbors of v 

Cn = current community of n 
Remove v from and assign it in 
Measure new permanence of v in community Cn, Efp{v) 

Measure new sum of permanences of all neighbor’s of v, Nn{v) 

SumJVp = Np{v) + Nn{v) 
if (|<S''um_Ap — SumjOpl <= 9) theu 

Assign a replica of v in Cn along with its original presence in C^ 
else 

Remove vertex v from Cn and place it back in C^ 
returu The updated overlapping community structure 

procedure PVOC 

Run Ad on G to obtain disjoint community structure NC 
Call VERTEX_REPLICATION(G,VC,6') 


(b) We remove v from its own community and place it to each of its external 
communities separately. This assignment affects the permanence value of v 
and its immediate neighbors. 

(c) For each external community Cn, we measure the current sum of permanence 
of V (in its new community) and its neighbors. 

(d) If the absolute value of the difference of the permanence values obtained from 
Step III (a) and Step III(c) is less than 9, a replica of v is placed into the new 
community Cn, keeping v in its original community as well; otherwise v is 
assigned back to its original community. This step identihes overlapping nodes 
along with their memberships in different communities. 

(e) The algorithm hnally returns all the vertices with new community membership. 

The threshold 9 controls the extent to which one can relax the condition of 
replicating a vertex into multiple communities. We vary the threshold from 0 to 0.2 
and observe that it produces maximum accuracy at 0.05 (see Figure E]). Therefore, 
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for the rest of the experiment, we keep the value of 9 as 0.05. Note that in the 
permanence-based post-processing step, we only consider those vertices having at least 
one external connection. The rationale behind this assumption is that vertices in the core 
of each community are often considered to be correctly placed by the disjoint community 
detection algorithm, whereas vertices which are placed in the peripheral region of the 
community and are loosely connected to the core of the community have high chance 
to be part of multiple communities. Figure [5] shows an empirical observation where we 
plot the relation between the number of external connections of a vertex in the detected 
disjoint community to the number of overlapping memberships of the vertex in ground- 
truth community. We observe that the correlation is increasing in nature, which indeed 
strengthens our hypothesis. 

The time complexity of measuring the permanence of a vertex takes O(d^), where 
d is the average degree of vertices in the network. In real-world networks, the value of 
d is much lower than log{n), where n is the number of nodes in the network. Therefore, 
the PVOC algorithm mostly depends on the underlying disjoint community detection 
algorithm. 
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Figure 5. (Color online) The relation between the average number of external 
connections of vertices (with the standard deviation) obtained from the output of 
the disjoint community detection algorithm (here we use Louvian) and the number of 
communities a vertex is a part of. 


5. Experiments 

PVOC with two popular disjoint community detection algorithms, namely 
)] and Infomap0 [HI 06]. These are chosen because they are reasonably 
accurate algorithms with the potential to handle large networks, and implementation of 
them, by their authors, are publicly available. 


We combine 
Louvain[JJ] u 


II https://sites.google.com/site/findcommunities/ 
^ http://www.tp.umu.se/~rosvall/code.html 
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5.1. Baseline algorithms 

We compare the performance of PVOC with the following state-of-the-art overlapping 
community detection algorithms, whose codes are also available: 

• Order statistics local optimization method (OSLOM): It is based on the local 
optimization of a htness function expressing the statistical signihcance of clusters 
with respect to random fluctuations, which is estimated with tools of Extreme and 
Order Statistics [5]. The code is available at http://www.osloni.org. 

• Community overlap propagation algorithm (COPRA): This algorithm is based 
on the label propagation technique of Raghavan et al [1], but is able to detect 
communities that overlap. Like the original algorithm, vertices have labels that 
propagate between neighboring vertices so that members of a community reach 
a consensus on their community membership [36]. The code is available at 
http://www.cs.bris.ac.uk/~steve/networks/software/copra.html, 

• Speaker listener propagation algorithm (SLPA): The algorithm is an extension 
of the Label Propagation Algorithm (LPA) [1]. In SLPA, each node can be 
a listener or a speaker. The roles are switched depending on whether a node 
serves as an information provider or information consumer. Typically, a node 
can hold as many labels as it likes, depending on what it has experienced 
in the stochastic processes driven by the underlying network structure. A 
node accumulates knowledge of repeatedly observed labels instead of erasing 
all but one of them. Moreover, the more a node observes a label, the more 
likely it will spread this label to other nodes |37|. The code is available at 
https://sites.google.com/site/communitydetectionslpa, 

• Agglomerative hierarchical clustering based on maximal clique (EAGLE): It uses 
the agglomerative framework to produce a dendrogram. First, all maximal 
cliques are found and made to be the initial communities. Then, the pair 
of communities with maximum similarity is merged. The optimal cut on the 
dendrogram is determined by the extended modularity with a weight based 
on the number of overlapping memberships [6|. The code is available at 
http://code.google.com/p/eaglepp/, 

• Cluster-overlap Newman Griven algorithm (CONGA): The idea of this algorithm 
is similar to our idea of finding overlapping communities from disjoint community 
structure. CONGA is based on Griven Newman’s “GN” algorithm [1] but 
extended to detect overlapping communities. CONGA adds to the GN algorithm 
the ability to split vertices between communities, based on the new concept of 
split betweenness. At first, edge betweenness of edges and split betweenness 
of vertices are calculated. Then an edge with maximum edge betweenness is 
removed or a vertex with maximum split betweenness is split. After this step, 
edge betweenness and split betweenness are recalculated. The above steps are 
repeated until no edges remain [7|. However, the calculation of edge betweenness 
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and split betweenness is expensive on large networks. The code is available at 
http://www.cs.bris.ac.uk/~steve/networks/congapaper/, 

• Cluster affiliation model for big networks (BIGCLAM); In this algorithm, 
communities arise due to shared community affiliations of nodes. Here the affiliation 
strength is explicitly modeled for each node to each community. Then each node¬ 
community pair is assigned a nonnegative latent factor which represents the degree 
of membership of a node to the community. The probability of an edge between a 
pair of nodes is then modeled in the network as a function of the shared community 
affiliations [35]. The code is available at http://snap.stanford.edu, 

Note that each algorithm is simply used with its default parameters. 




e 0 


Figure 6. (Color online) Accuracy of PVOC (in terms of three validation metrics) with 
the increase of 9 for LFR [y = 0.3) and one real-world network (DBLP). Maximum 
accuracy is obtained at 0 = 0.05, which we use in rest of the experiment. Each point 
in the plot is an average of the accuracies obtained from Louvain and Infomap. 


5.2. Validation metrics 


A stronger test of the correctness of the community detection algorithm, however, is by 
comparing the obtained community with a given ground-truth structure. For evaluation, 
we use three metrics that quantify the level of correspondence between the detected and 
the ground- truth communities [35] . 


Overlapping Normalized Mutual InformationEI 


• Omega Index [2l] 

• Average F score [19] 


Note that all the metrics are bounded between 0 (no matching) and 1 (perfect 
matching). 


https://github.com/aaronmcdaid/Overlapping-NMI 
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Figure 7. (Color online) Accuracy of all the competing algorithms for LFR by varying 
y (top panel, where N = 10,000, = 4, 0„ = 5%), (middle point, where 

N = 10,000, y = 0.1, On = 5%) and (bottom panel, where N = 10,000, y = 0.1, 
Om = 4)). Note that the value of 0„ is expressed in % of n (LVN: Louvain, INFO: 
Infomap). 


5.3. Experimental results 

In this experiment, we use PVOC combined with Louvain and Informap separately, and 
compare the results with six baseline algorithms. First, we check the dependency of 
PVOC with the value of 6. Figure [6] shows that at 6* = 0.05, PVOC achieves maximum 
accuracy for LFR and one representative real-world network; however the result is almost 
same of other networks. Therefore, we use 0 = 0.05 in the rest of the experiments. One 
can tune 9 appropriately to control the extent of overlapping membership of vertices in 
the network. 

In Figure 0 we compare the outputs obtained from different competing algorithms 
with the ground-truth communities for LFR networks with different parameter settings. 
Figure [7] (top panel) shows the results for different values of p ranging from 0.1 to 0.5. 
As p, increases, the community structure becomes less evident and it becomes difficult 
for all the algorithms to discover the actual community structure. OSLOM performs 
worst compared to the other algorithms. However, for all the cases, PVOC-I-LVN is least 
affected and outperforms other algorithms. This is followed by PVOC-I-INFO, CONGO 
and SLPA. 

We then vary the average number of community memberships per vertex, from 2 
to 8 keeping the other parameters same, and plot the performance of different algorithms 
in Figure [7] (middle panel). The effect is reasonably less on the accuracy of the competing 
algorithms. Here we observe that the pattern is almost similar for PVOC-I-LVN and 
PVOC-blNFO, and are much superior than others. 

Finally, in Figure [7] (lower panel) we plot the accuracy of the algorithms with 
the increasing value of On, percentage of overlapping vertices. Surprisingly, OSLOM 
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shows an unexpected behavior with the increasing accuracy after a certain value of On- 
However, on an average the change in accuracy is almost consistent for all the algorithms 
in all possibilities of On- 

To understand the utility of including PVOC step with the disjoint community 
hnding algorithms in more details, we further measure the performance of Louvain 
and Infomap in isolation without PVOC step. We observe that excluding PVOC step 
signihcantly deteriorates the performance of Louvain algorithm: for LFR network {N 
= 10,000, Om = 4:, On = 5% , /x=0.2) ONMI (0.569), Omega Index (0.512), F-Score 
(0.523); for DBLP network ONMI (0.495), Omega Index (0.521), F-Score (0.487); for 
Amazon network ONMI (0.458), Omega Index (0.498), F-Score (0.447); for Youtube 
network ONMI (0.512), Omega Index (0.522), F-Score (0.564); and for Orkut network 
ONMI (0.526), Omega Index (0.556), F-Score (0.544). Similar trend is observed for 
Infomap algorithm. This observation therefore strengthens the need of PVOC as a 
post-processing step with the disjoint community detection algorithms. 

Now, we run the competing algorithms on the real-world networks. As noted in [35], 
most of the baseline community detection algorithms do not scale for networks of large 
size. Therefore, we use the following technique proposed by Yan and Leskovec [35] to 
obtain several small subnetworks with overlapping community structure from the large 
real networks. We pick a random node u in the given graph G that belongs to at least two 
communities. We then take the subnetwork to be the induced subgraph of G consisting 
of all the nodes that share at least one ground-truth community membership with u- 
In our experiments, we created 500 different subnetworks for each of the six real-world 
datasets and the results are averaged over these 500 samples. For each validation metric 
(ONMI, O Index, F-Score), we separately scale the scores of the methods so that the 
best performing community detection method has the score of 1. Finally, we compute 
the composite performance by summing up three normalized scores. If a method 
outperforms all the other methods in all the scores, then its composite performance 
is 3. 

Figure IH] displays the composite performance of the methods for different networks. 
On an average, the composite performance of PVOC-I-INFO (2.88) and PVOC-I-LVN 
(2.74) significantly outperform other competing algorithms: 6.27% higher than that 
of BIGCLAM (2.71), 18.03% higher than that of SLPA (2.44), 101.3% higher than 
that of OSLOM (1.43), 36.4% higher than that of COPRA (2.11), 48.4% higher than 
that of CONGA (1.94), and 77.8% higher than that of EAGLE (1.62). The absolute 
average ONMI of PVOC-I-INFO (PVOC-I-LVN) for one LFR and six real networks taken 
together is 0.85 (0.83), which is 4.93% (2.46%) and 26.8% (20.8%) higher than the two 
most competing algorithms, i.e., BIGCLAM (0.81), and SLPA (0.67) respectively. In 
terms of absolute values of scores, PVOC-I-INFO (PVOC-I-LVN) achieves the average 
F-Score of 0.84 (0.79) and average 0 Index of 0.83 (0.82). Overall, PVOC combined with 
Louvain and Infomap gives the best results, followed by BIGCLAM, SLPA, COPRA, 
CONGO, EAGLE and OSLOM. 

As most of the baseline algorithms except BIGCLAM do not scale for large real 
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Figure 8. (Color online) Performance of various competing algorithms to detect the 
ground-truth communities. For each evaluation metric separately we scale the score of 
the methods so that the best performing community detection algorithm achieves the 
score of 1. Thus if an algorithm outperforms all the methods in all the scores, then its 
composite score would become 3. 


networks [35], we separately compare PVOC with BIGCLAM (which is scalable and also 
the most competing algorithm) on actual large real datasets. Table |3]shows performance 
of PVOC and BIGCLAM for different real networks. On average, PVOC+INFO 
(PVOC+LVN) achieves 4.28% (5.63%) higher ONMI, 1.48% (2.85%) higher O Index, 
and 6.94% (5.63%) higher F-Score. Overall, PVOC outperforms BIGCLAM in every 
measure and for every network. The absolute values of the scores of PVOC+INFO and 
PVOC+LVN averaged over all the networks are 0.70 and 0.71 (ONMI), 0.69 and 0.70 
(O Index), and 0.72 and 0.71 (F-Score) respectively. 


Table 3. The performance of BIGCLAM and PVOC on large real-world networks. 


Networks 

BIGCLAM 

PVOC+LVN 

PVOC+INFO 

ONMI 

Omega 

F Score 

ONMI 

Omega 

F Score 

ONMI 

Omega 

F Score 

DBLP 

0.61 

0.59 

0.54 

0.65 

0.61 

0.60 

0.65 

0.62 

0.59 

Amazon 

0.73 

0.69 

0.74 

0.72 

0.71 

0.75 

0.73 

0.74 

0.76 

Orkut 

0.65 

0.68 

0.64 

0.72 

0.70 

0.76 

0.73 

0.72 

0.77 

Youtube 

0.68 

0.76 

0.78 

0.77 

0.78 

0.72 

0.71 

0.68 

0.78 


Many optimization algorithms have the tendency to underestimate smaller size 
communities [50] and sometimes tend to produce very large size communities. In our 
test suite, we observe the similar tendency in BIGCLAM whereas the communities 
obtained by PVOC based algorithms are comparable in size with respect to the ground- 
truth. Earlier in Table |2l we have mentioned the number of communities detected 
by PVOC based algorithms (the number of communities does not change due to the 
inclusion of PVOC step with Louvain and Infomap). In Table H] we show for both LFR 
and real-world networks that the size of the largest and smallest communities detected 
by BIGCLAM is much larger than that present in the ground-truth structure. We also 
measure the similarity (using Jaccard coefficient) between the largest and smallest-size 
communities detected by BIGCLAM and PVOC based algorithms with the communities 
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in ground-truth structure and notice that PVOC based algorithms are able to detect 
both largest and smallest-size communities which are most similar to the ground-truth 
structure. Therefore, we hypothesize that our algorithm has the potentiality to produce 
meaningful communities which have high resemblance with the ground-truth structure. 


Table 4. Size of the largest and smallest communities present in the ground-truth and 
that obtained from BIGCLAM and PVOC based algorithms (the Jaccard similarities 
between the results obtained from the algorithms with the ground-truth structure are 
reported within parenthesis) for both LFR and real-world networks. 


Networks 

Ground-truth 

BIGCLAM 

PVOC+LVN 

PVOC+INFO 

Max Size 

Min size 

Max Size 

Min size 

Max Size 

Min size 

Max Size 

Min size 

DBLP 

3,458 

124 

9,876 (0.56) 

877 (0.48) 

4,098 (0.71) 

243 (0.82) 

4,143 (0.76) 

204 (0.81) 

Amazon 

5,987 

245 

10,109 (0.45) 

765 (0.57) 

6,876 (0.69) 

398 (0.75) 

6,367 (0.72) 

323 (0.83) 

Orkut 

10,687 

1,876 

13,768 (0.72) 

2,985 (0.69) 

11,976 (0.74) 

1,908 (0.79) 

11,345 (0.75) 

1,976 (0.79) 

Youtube 

8,987 

765 

9,976 (0.65) 

1,098 (0.62) 

8,876 (0.76) 

987 (0.71) 

9,018 (0.74) 

865 (0.82) 


6. Conclusions 

In this paper, we presented a study to show that there is perhaps less need of developing 
yet another algorithm for finding overlapping communities from the network. We 
demonstrated how the output of an efficient disjoint community detection algorithm can 
be leveraged to discover the overlapping community structure. For that, we proposed 
a novel, two-phase framework, called PVOC that can be combined with any efficient 
disjoint community detection algorithm. PVOC uses a new metric, called permanence 
in the post-processing step on each vertex and detects the overlapping vertices from 
the non-overlapping structure. We combined PVOC with two efficient and scalable 
algorithms, Louvain and Informap. Experimental results showed that our approach is 
viable in producing meaningful overlapping communities quite efficiently even from the 
large real world networks in terms of high resemblance with the ground-truth community 
structure. PVOC is controlled by only one parameter 6, which can be efficiently tuned 
to increase the extent of overlapping memberships per vertex in a network. 

However, a major drawback of PVOC is that it produces exactly the same number 
of overlapping communities that the disjoint community detection algorithm produces. 
However, it might be possible that due to the overlapping nature of a community, new 
community might emerge from the disjoint community structure. As an immediate step, 
we would like to include a new module in the post-processing step that would consider 
the emergence of new communities. Moreover, we would try to evaluate PVOC in 
conjunction with even more disjoint community detection algorithms. To conclude, 
we would like to emphasize on the fact that considering such a massive literature 
particularly on community detection, it is perhaps the good time to put an end to 
such consistent effort of proposing yet another algorithm, and to revisit some of the 
existing algorithms that are efficient enough to fulfill both the purpose of discovering 
disjoint and overlapping communities from the network. 
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